Goto

Collaborating Authors

 hybrid-cylindrical-spherical voxelization


Every View Counts: Cross-View Consistency in 3D Object Detection with Hybrid-Cylindrical-Spherical Voxelization

Neural Information Processing Systems

Recent voxel-based 3D object detectors for autonomous vehicles learn point cloud representations either from bird eye view (BEV) or range view (RV, a.k.a. the perspective view). However, each view has its own strengths and weaknesses. In this paper, we present a novel framework to unify and leverage the benefits from both BEV and RV. The widely-used cuboid-shaped voxels in Cartesian coordinate system only benefit learning BEV feature map. Therefore, to enable learning both BEV and RV feature maps, we introduce Hybrid-Cylindrical-Spherical voxelization. Our findings show that simply adding detection on another view as auxiliary supervision will lead to poor performance. We proposed a pair of cross-view transformers to transform the feature maps into the other view and introduce cross-view consistency loss on them. Comprehensive experiments on the challenging NuScenes Dataset validate the effectiveness of our proposed method by virtue of joint optimization and complementary information on both views. Remarkably, our approach achieved mAP of 55.8%, outperforming all published approaches by at least 3% in overall performance and up to 16.5% in safety-crucial categories like cyclist.


Review for NeurIPS paper: Every View Counts: Cross-View Consistency in 3D Object Detection with Hybrid-Cylindrical-Spherical Voxelization

Neural Information Processing Systems

There are many small language mistakes, mostly in the technical section (Section 3), but they are not the main problem. The proposed method is simple (which, again, is something good), but somehow it is difficult to understand from the text. I try to detail below what could be changed to improve the text clarity: - Calling "Cross-view transformers" the mapping functions used in the constraint term is confusing, as "transformer" means other thing in deep learning (transformers in NLP, spatial transformers) - Section 3.4 (about the transformers) mentions features, while in fact it is the final outputs that are "transformed" - it is not said explicitly that the weights in Eq (1) are learned in Section 3.4 - Eqs (3) to (6) seem to use the Euclidean(?) norm, while the authors probably meant some similarity functions; - Eqs (6) is disconnected from the text - Figure 1 is very dense and it is difficult to understand the method from it, while it should be possible to convey visually the method in a simple way - mentioning the Hough transform to explain the method did not make the presentation more intuitive for me.


Review for NeurIPS paper: Every View Counts: Cross-View Consistency in 3D Object Detection with Hybrid-Cylindrical-Spherical Voxelization

Neural Information Processing Systems

The paper proposes a method for LIDAR-based object detection that exploits cross-view consistency between bird's-eye view and range view point clouds of the scene. The two inputs are fed to separate neural networks trained with a loss function that includes a term that encourages consistency between the two representations. Evaluations demonstrate strong performance compared to baselines on NuScenes. The paper was reviewed by four knowledgeable referees, who read the author response and subsequently discussed the paper. The reviewers agree that the manner in which the method exploits the bird's-eye and range views is interesting and elegant, namely the HCS voxel representation that enables feature extraction for both views and the manner in which the method enforces consistency on the transformed feature representations.


Every View Counts: Cross-View Consistency in 3D Object Detection with Hybrid-Cylindrical-Spherical Voxelization

Neural Information Processing Systems

Recent voxel-based 3D object detectors for autonomous vehicles learn point cloud representations either from bird eye view (BEV) or range view (RV, a.k.a. the perspective view). However, each view has its own strengths and weaknesses. In this paper, we present a novel framework to unify and leverage the benefits from both BEV and RV. The widely-used cuboid-shaped voxels in Cartesian coordinate system only benefit learning BEV feature map. Therefore, to enable learning both BEV and RV feature maps, we introduce Hybrid-Cylindrical-Spherical voxelization. Our findings show that simply adding detection on another view as auxiliary supervision will lead to poor performance.